(53940, 10)
January 25, 2026
Take a single variable in a dataset.
Visualize to learn more about it. Key cases of interest: * Categorical variables (in lecture 2 we called these N/O) * Continuous variables (in lecture 2 we called these Q) * Exploring typical values * Exploring and dealing with unusual values
(Up next class: covariation with two variables)
N/O)Q)
Data visualization has two distinct goals
“Make dozens of plots”
Quoctrung Bui, former 30538 guest lecturer and former Harris data viz instructor
What does he mean?
diamonds, mpg are from “Exploratory Data Analysis”movies (also used last lecture) is from the UW textbookpenguins is from “Data Visualization”diamondsdiamonds| carat | cut | color | clarity | depth | table | price | x | y | z | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.23 | Ideal | E | SI2 | 61.5 | 55.0 | 326 | 3.95 | 3.98 | 2.43 |
| 1 | 0.21 | Premium | E | SI1 | 59.8 | 61.0 | 326 | 3.89 | 3.84 | 2.31 |
| 2 | 0.23 | Good | E | VS1 | 56.9 | 65.0 | 327 | 4.05 | 4.07 | 2.31 |
| 3 | 0.29 | Premium | I | VS2 | 62.4 | 58.0 | 334 | 4.20 | 4.23 | 2.63 |
| 4 | 0.31 | Good | J | SI2 | 63.3 | 58.0 | 335 | 4.34 | 4.35 | 2.75 |
diamonds data dictionary| Variable | Definition | Values |
|---|---|---|
price |
price in USD | $326–$18,823 |
carat |
weight of diamond | 0.2-5.01 |
cut |
quality of the cut | Fair, Good, Very Good, Premium, Ideal |
color |
diamond color | D (best) to J (worst) |
clarity |
measure of how clear diamond is | I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best) |
x |
length in mm | 0-10.74 |
y |
width in mm | 0-58.9 |
z |
depth in mm | 0-31.8 |
depth |
\(\frac{z}{ mean(x, y)}\) | 43-79 |
table |
width of top of diamond relative to widest point | 43-95 |
cut in a tablecut
Fair 1610
Good 4906
Very Good 12082
Premium 13791
Ideal 21551
dtype: int64
#moves the index (`cut`) into a normal column so Altair can read it.
diamonds_cut = diamonds_cut.reset_index().rename(columns={0:'N diamonds'})
# extract ordering of `cut` as a Python list to use for sorting
cut_order = diamonds['cut'].cat.categories.tolist()
alt.Chart(diamonds_cut).mark_bar().encode(
alt.X('cut:O', title = "Cut", sort=cut_order),
alt.Y('N diamonds:Q', title = "Count")
).properties(width=640, height=360).configure_axis(
labelFontSize=18,
titleFontSize=20
)Note: we have included syntax to modify graph properties here. Going forward our .qmd source code uses these throughout, but we will omit in the slides for the sake of space.
mark_point() instead of mark_bar(), but overall, there’s a clear right answer about how to do this.moviespenguinsdiamondsRemark: The skills are absolutely fundamental and so we will intentionally be a bit repetitive.
movies datasetmovies_url = 'https://cdn.jsdelivr.net/npm/vega-datasets@1/data/movies.json'
movies = pd.read_json(movies_url)
movies.head()| Title | US_Gross | Worldwide_Gross | US_DVD_Sales | Production_Budget | Release_Date | MPAA_Rating | Running_Time_min | Distributor | Source | Major_Genre | Creative_Type | Director | Rotten_Tomatoes_Rating | IMDB_Rating | IMDB_Votes | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | The Land Girls | 146083.0 | 146083.0 | NaN | 8000000.0 | Jun 12 1998 | R | NaN | Gramercy | None | None | None | None | NaN | 6.1 | 1071.0 |
| 1 | First Love, Last Rites | 10876.0 | 10876.0 | NaN | 300000.0 | Aug 07 1998 | R | NaN | Strand | None | Drama | None | None | NaN | 6.9 | 207.0 |
| 2 | I Married a Strange Person | 203134.0 | 203134.0 | NaN | 250000.0 | Aug 28 1998 | None | NaN | Lionsgate | None | Comedy | None | None | NaN | 6.8 | 865.0 |
| 3 | Let's Talk About Sex | 373615.0 | 373615.0 | NaN | 300000.0 | Sep 11 1998 | None | NaN | Fine Line | None | Comedy | None | None | 13.0 | NaN | NaN |
| 4 | Slam | 1009819.0 | 1087521.0 | NaN | 1000000.0 | Oct 09 1998 | R | NaN | Trimark | Original Screenplay | Drama | Contemporary Fiction | None | 62.0 | 3.4 | 165.0 |
mark_bar()mark_bar()hist_rt = alt.Chart(movies_url).mark_bar().encode(
alt.X('Rotten_Tomatoes_Rating:Q', bin=alt.BinParams(maxbins=20), title = "Rotten Tomatoes Rating (%)"),
alt.Y('count():Q', title = "Count")
)
hist_rtDiscussion question: what are the headline and sub-messages?
IMDB ratings are formed by averaging scores (ranging from 1 to 10) provided by the site’s users.
hist_imdb = alt.Chart(movies_url).mark_bar().encode(
alt.X('IMDB_Rating:Q', bin=alt.BinParams(maxbins=20), title = "IMDB Ratings"),
alt.Y('count():Q', title = "Count")
)
hist_imdbDiscussion question: compare the two ratings distributions. If your goal for the headline of the graph is about differentiating between good and bad movies, which rating is more informative?
penguins dataseturl = ("https://raw.githubusercontent.com/mcnakhaee/palmerpenguins/master/palmerpenguins/data/penguins.csv")
penguins = pd.read_csv(url)
penguins.head()| species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year | |
|---|---|---|---|---|---|---|---|---|
| 0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | male | 2007 |
| 1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | female | 2007 |
| 2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | female | 2007 |
| 3 | Adelie | Torgersen | NaN | NaN | NaN | NaN | NaN | 2007 |
| 4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | female | 2007 |
BinParams(maxbins=20)) and let altair choose “nice”-looking bin widths for the histogramstepalt.Chart(penguins).mark_bar().encode(
alt.X('body_mass_g:Q', bin=alt.BinParams(step=200), title = "Body Mass (g)"),
alt.Y('count():Q', title = "Count")
)step parameterstep=20 vs. step=200 vs, step=2000
Discussion question: what headline message(s) come from each step choice? Which do you prefer?
An alternative to a histogram for exploring frequency in continuous variable: density plot using transform_density
alt.Chart(penguins).transform_density(
'body_mass_g',
as_=['body_mass_g2', 'density']
).mark_area().encode(
alt.X('body_mass_g2:Q', title = "Body Mass (g)"),
alt.Y('density:Q', title = "Density")
)caratalt.data_transformers.disable_max_rows() #disable 5k max rows
alt.Chart(diamonds).mark_bar().encode(
alt.X('carat', bin=alt.Bin(maxbins=10), title = "Carat"),
alt.Y('count()', title = "Count")
)First plot iteration reveals most of sample is < 2
caratdiamonds_small = diamonds.loc[diamonds['carat'] < 2.1]
alt.Chart(diamonds_small).mark_bar().encode(
alt.X('carat', bin=alt.BinParams(step=0.2), title = "Carat"),
alt.Y('count()', title = "Count")
)Second plot iteration reveals count is not entirely decreasing in carat
caratalt.Chart(diamonds_small).mark_bar().encode(
alt.X('carat', bin=alt.BinParams(step=0.02), title = "Carat"),
alt.Y('count()', title = "Count"))Discussion questions
y dimension in diamonds
diamonds: identify unusual y valuesFirst pass to examine for unusual values: summary statistics
count 53940.000000
mean 5.734526
std 1.142135
min 0.000000
25% 4.720000
50% 5.710000
75% 6.540000
max 58.900000
Name: y, dtype: float64
diamonds: examine unusual y values| carat | cut | color | clarity | depth | table | price | x | y | z | |
|---|---|---|---|---|---|---|---|---|---|---|
| 11963 | 1.00 | Very Good | H | VS2 | 63.3 | 53.0 | 5139 | 0.0 | 0.0 | 0.0 |
| 15951 | 1.14 | Fair | G | VS1 | 57.5 | 67.0 | 6381 | 0.0 | 0.0 | 0.0 |
| 24520 | 1.56 | Ideal | G | VS2 | 62.2 | 54.0 | 12800 | 0.0 | 0.0 | 0.0 |
| 26243 | 1.20 | Premium | D | VVS1 | 62.1 | 59.0 | 15686 | 0.0 | 0.0 | 0.0 |
| 27429 | 2.25 | Premium | H | SI2 | 62.8 | 59.0 | 18034 | 0.0 | 0.0 | 0.0 |
| 49556 | 0.71 | Good | F | SI2 | 64.1 | 60.0 | 2130 | 0.0 | 0.0 | 0.0 |
| 49557 | 0.71 | Good | F | SI2 | 64.1 | 60.0 | 2130 | 0.0 | 0.0 | 0.0 |
diamonds: examine unusual y valuesdiamonds: compare to 10 random obs| carat | cut | color | clarity | depth | table | price | x | y | z | |
|---|---|---|---|---|---|---|---|---|---|---|
| 46051 | 0.54 | Ideal | E | VS1 | 62.6 | 55.0 | 1732 | 5.20 | 5.22 | 3.26 |
| 52198 | 0.70 | Premium | G | VS2 | 61.8 | 60.0 | 2479 | 5.67 | 5.63 | 3.49 |
| 15736 | 1.01 | Ideal | G | VS2 | 62.0 | 54.9 | 6295 | 6.41 | 6.44 | 3.99 |
| 39959 | 0.33 | Premium | D | SI2 | 61.6 | 59.0 | 492 | 4.41 | 4.42 | 2.72 |
| 17562 | 1.00 | Premium | F | VS2 | 59.0 | 59.0 | 7072 | 6.54 | 6.51 | 3.85 |
| 34918 | 0.34 | Premium | F | VS2 | 61.2 | 59.0 | 880 | 4.52 | 4.47 | 2.75 |
| 31252 | 0.46 | Ideal | F | SI2 | 61.5 | 54.0 | 758 | 4.98 | 5.01 | 3.07 |
| 47113 | 0.65 | Very Good | I | VVS2 | 61.3 | 59.3 | 1828 | 5.52 | 5.59 | 3.40 |
| 48830 | 0.70 | Good | H | SI1 | 63.3 | 60.0 | 2029 | 5.57 | 5.65 | 3.55 |
| 3071 | 0.80 | Premium | D | SI1 | 61.7 | 58.0 | 3312 | 5.96 | 5.93 | 3.67 |
NA| carat | cut | color | clarity | depth | table | price | x | y | z | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.23 | Ideal | E | SI2 | 61.5 | 55.0 | 326 | 3.95 | 3.98 | 2.43 |
| 1 | 0.21 | Premium | E | SI1 | 59.8 | 61.0 | 326 | 3.89 | 3.84 | 2.31 |
| 2 | 0.23 | Good | E | VS1 | 56.9 | 65.0 | 327 | 4.05 | 4.07 | 2.31 |
| 3 | 0.29 | Premium | I | VS2 | 62.4 | 58.0 | 334 | 4.20 | 4.23 | 2.63 |
| 4 | 0.31 | Good | J | SI2 | 63.3 | 58.0 | 335 | 4.34 | 4.35 | 2.75 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 53935 | 0.72 | Ideal | D | SI1 | 60.8 | 57.0 | 2757 | 5.75 | 5.76 | 3.50 |
| 53936 | 0.72 | Good | D | SI1 | 63.1 | 55.0 | 2757 | 5.69 | 5.75 | 3.61 |
| 53937 | 0.70 | Very Good | D | SI1 | 62.8 | 60.0 | 2757 | 5.66 | 5.68 | 3.56 |
| 53938 | 0.86 | Premium | H | SI2 | 61.0 | 58.0 | 2757 | 6.15 | 6.12 | 3.74 |
| 53939 | 0.75 | Ideal | D | SI2 | 62.2 | 55.0 | 2757 | 5.83 | 5.87 | 3.64 |
53940 rows × 10 columns
diamonds_missing = diamonds.copy()
diamonds_missing['y'] = np.where((diamonds_missing['y'] < 3) |
(diamonds_missing['y'] > 20),
np.nan, diamonds_missing['y'])
diamonds_missing[diamonds_missing['y'].isna()]| carat | cut | color | clarity | depth | table | price | x | y | z | |
|---|---|---|---|---|---|---|---|---|---|---|
| 11963 | 1.00 | Very Good | H | VS2 | 63.3 | 53.0 | 5139 | 0.00 | NaN | 0.00 |
| 15951 | 1.14 | Fair | G | VS1 | 57.5 | 67.0 | 6381 | 0.00 | NaN | 0.00 |
| 24067 | 2.00 | Premium | H | SI2 | 58.9 | 57.0 | 12210 | 8.09 | NaN | 8.06 |
| 24520 | 1.56 | Ideal | G | VS2 | 62.2 | 54.0 | 12800 | 0.00 | NaN | 0.00 |
| 26243 | 1.20 | Premium | D | VVS1 | 62.1 | 59.0 | 15686 | 0.00 | NaN | 0.00 |
| 27429 | 2.25 | Premium | H | SI2 | 62.8 | 59.0 | 18034 | 0.00 | NaN | 0.00 |
| 49189 | 0.51 | Ideal | E | VS1 | 61.8 | 55.0 | 2075 | 5.15 | NaN | 5.12 |
| 49556 | 0.71 | Good | F | SI2 | 64.1 | 60.0 | 2130 | 0.00 | NaN | 0.00 |
| 49557 | 0.71 | Good | F | SI2 | 64.1 | 60.0 | 2130 | 0.00 | NaN | 0.00 |
Winsorizing re-codes outliers to a numeric value, keeping them in the data.
To winsorize at 1 percent:
diamonds_winsor = diamonds.copy()
pctile01 = diamonds_winsor['y'].quantile(0.01)
pctile99 = diamonds_winsor['y'].quantile(0.99)
print(f"1st Percentile: {pctile01}")
print(f"99th Percentile: {pctile99}")1st Percentile: 4.04
99th Percentile: 8.34
diamonds_winsor['y_winsor'] = np.where(diamonds_winsor['y'] < pctile01, pctile01,
np.where(diamonds_winsor['y'] > pctile99, pctile99,
diamonds_winsor['y']))
diamonds_winsor| carat | cut | color | clarity | depth | table | price | x | y | z | y_winsor | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.23 | Ideal | E | SI2 | 61.5 | 55.0 | 326 | 3.95 | 3.98 | 2.43 | 4.04 |
| 1 | 0.21 | Premium | E | SI1 | 59.8 | 61.0 | 326 | 3.89 | 3.84 | 2.31 | 4.04 |
| 2 | 0.23 | Good | E | VS1 | 56.9 | 65.0 | 327 | 4.05 | 4.07 | 2.31 | 4.07 |
| 3 | 0.29 | Premium | I | VS2 | 62.4 | 58.0 | 334 | 4.20 | 4.23 | 2.63 | 4.23 |
| 4 | 0.31 | Good | J | SI2 | 63.3 | 58.0 | 335 | 4.34 | 4.35 | 2.75 | 4.35 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 53935 | 0.72 | Ideal | D | SI1 | 60.8 | 57.0 | 2757 | 5.75 | 5.76 | 3.50 | 5.76 |
| 53936 | 0.72 | Good | D | SI1 | 63.1 | 55.0 | 2757 | 5.69 | 5.75 | 3.61 | 5.75 |
| 53937 | 0.70 | Very Good | D | SI1 | 62.8 | 60.0 | 2757 | 5.66 | 5.68 | 3.56 | 5.68 |
| 53938 | 0.86 | Premium | H | SI2 | 61.0 | 58.0 | 2757 | 6.15 | 6.12 | 3.74 | 6.12 |
| 53939 | 0.75 | Ideal | D | SI2 | 62.2 | 55.0 | 2757 | 5.83 | 5.87 | 3.64 | 5.87 |
53940 rows × 11 columns
An example from Earnings Instability paper by Ganong and coauthors.
The paper is trying to quantify how much earnings change from month to month for the typical US worker.
Consider the following fake data (next slide)
Suppose we have observations for earnings changes. 99% of the data follows a normal distribution with std. dev. 0.2 and 1% of the data is extremely large changes
| last month ($) | this month ($) | % change | |% change | |
|---|---|---|---|
| 600 | 600 | 0% | 0% |
| 600 | 570 | -5% | 5% |
| 600 | 540 | -10% | 10% |
| 600 | 630 | 5% | 5% |
| … | |||
| (99% of sample) | |||
| … | |||
| 600 | 300 | -50% | 50% |
| 6000 | 300 | -95% | 95% |
| 300 | 600 | 100% | 100% |
| 300 | 6000 | 1900% | 1900% |
What is the standard deviation of the % change in earnings?
| assumption | SD |
|---|---|
| do not winsorize | 97.2% |
| winsorize at 50% | 20.5% |
Illustrative calculation here
When else is this useful? Income data, test scores, stock returns.
Source: Table A-2 from Earnings Instability paper
diamonds: what would you do?x, y, and z are all 0?y > 20?diamonds: what would we do?There is often not a “right” answer or you won’t know the answer without talking to a data provider.
Our best guesses:
x, y, and z are all zero: set to NAy > 20: winsorize? (hard to know for sure…)| Problem | Action |
|---|---|
| Erroneous row | drop row |
| Erroneous cell | set to NA or winsorize |
How do I decide which problem I have? Examine unusual values in context of other columns (same row) and other rows (same columns).
How do I decide whether to set to NA or winsorize? Ideally, ask your data provider what’s going on with these values.
Introduce mpg dataset
Research question 1
What is the relationship between engine size and gas mileage?”
Research question 2
Why do some cars have better than typical mileage?
Ad hoc identification of outliers
Inspect fields describing outliers
Uncover pattern
mpg datasetmanufacturer — car maker (e.g., toyota, ford)model — specific model namedispl — engine size (liters)hwy — gas mileage highway miles per gallonclass — vehicle class (compact, suv, pickup, etc.)mpg dataset| manufacturer | model | displ | hwy | class | |
|---|---|---|---|---|---|
| 0 | audi | a4 | 1.8 | 29 | compact |
| 1 | audi | a4 | 1.8 | 29 | compact |
| 2 | audi | a4 | 2.0 | 31 | compact |
| 3 | audi | a4 | 2.0 | 30 | compact |
| 4 | audi | a4 | 2.8 | 26 | compact |
| ... | ... | ... | ... | ... | ... |
| 229 | volkswagen | passat | 2.0 | 28 | midsize |
| 230 | volkswagen | passat | 2.0 | 29 | midsize |
| 231 | volkswagen | passat | 2.8 | 26 | midsize |
| 232 | volkswagen | passat | 2.8 | 26 | midsize |
| 233 | volkswagen | passat | 3.6 | 26 | midsize |
234 rows × 5 columns
Discussion q – which fields do you want to study further on the plot (and why?)
['manufacturer', 'model', 'displ', 'year', 'cyl', 'trans', 'drv', 'cty', 'hwy', 'fl', 'class']
| manufacturer | model | displ | year | cyl | trans | drv | cty | hwy | fl | class | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 23 | chevrolet | corvette | 5.7 | 1999 | 8 | manual(m6) | r | 16 | 26 | p | 2seater |
| 24 | chevrolet | corvette | 5.7 | 1999 | 8 | auto(l4) | r | 15 | 23 | p | 2seater |
| 25 | chevrolet | corvette | 6.2 | 2008 | 8 | manual(m6) | r | 16 | 26 | p | 2seater |
| 26 | chevrolet | corvette | 6.2 | 2008 | 8 | auto(s6) | r | 15 | 25 | p | 2seater |
| 27 | chevrolet | corvette | 7.0 | 2008 | 8 | manual(m6) | r | 15 | 24 | p | 2seater |
| 158 | pontiac | grand prix | 5.3 | 2008 | 8 | auto(s4) | f | 16 | 25 | p | midsize |
| 212 | volkswagen | jetta | 1.9 | 1999 | 4 | manual(m5) | f | 33 | 44 | d | compact |
| 221 | volkswagen | new beetle | 1.9 | 1999 | 4 | manual(m5) | f | 35 | 44 | d | subcompact |
| 222 | volkswagen | new beetle | 1.9 | 1999 | 4 | auto(l4) | f | 29 | 41 | d | subcompact |
Fields
modelclassHow did I know to use these? Context knowledge about different types of cars.
Don’t have context knowledge about your dataset? Use LLM/Google/human subject matter expert to help you identify patterns
colormodel fieldclass would capture what these models would have in commonclass) explains many of the outlier patterns in terms of fuel-efficient cars
class == “2seater”class == “subcompact” (but many subcompacts are less fuel efficient)Research question
Why do some cars have better than typical mileage? (What’s going on with these outliers?)
What did we do?